38 research outputs found

    Block-diagonal covariance selection for high-dimensional Gaussian graphical models

    Get PDF
    Gaussian graphical models are widely utilized to infer and visualize networks of dependencies between continuous variables. However, inferring the graph is difficult when the sample size is small compared to the number of variables. To reduce the number of parameters to estimate in the model, we propose a non-asymptotic model selection procedure supported by strong theoretical guarantees based on an oracle inequality and a minimax lower bound. The covariance matrix of the model is approximated by a block-diagonal matrix. The structure of this matrix is detected by thresholding the sample covariance matrix, where the threshold is selected using the slope heuristic. Based on the block-diagonal structure of the covariance matrix, the estimation problem is divided into several independent problems: subsequently, the network of dependencies between variables is inferred using the graphical lasso algorithm in each block. The performance of the procedure is illustrated on simulated data. An application to a real gene expression dataset with a limited sample size is also presented: the dimension reduction allows attention to be objectively focused on interactions among smaller subsets of genes, leading to a more parsimonious and interpretable modular network.Comment: Accepted in JAS

    Transformation des données et comparaison de modèles pour la classification des données RNA-seq

    Get PDF
    International audienceLes données d'expression issues du séquençage haut-débit (RNA-seq) sont des données de comptage très hétérogènes. Il est naturel de les représenter par des modèles basés sur des lois discrètes comme la loi de Poisson ou la loi binomiale négative. Mais des transformations simples des données peuvent permettre de se ramener à des modèles plus répandus fondés sur des lois gaussiennes. Nous montrons comment comparer objectivement les vraisemblances de ces modèles travaillant sur des données différentes. Nous nous focalisons pour mener ces comparaisons sur des problèmes de classification où les mélanges de Poisson et gaussiens peuvent etre mis en compétition.High-throughput transcriptome sequencing data (RNA-seq) are made up of highly heterogeneous counts. Although they are often modeled with discrete distributions, including the Poisson and negative binomial distributions, Gaussian models on transformed data could alternatively be considered. We show how the likelihood of these different models can be objectively compared. We focus attention on the problem of clustering gene profiles, where Poisson mixtures on count data are compared with Gaussian mixtures on transformed data

    A model selection criterion for model-based clustering of annotated gene expression data

    Get PDF
    International audienceIn co-expression analyses of gene expression data, it is often of interest to interpret clusters of co-expressed genes with respect to a set of external information, such as a potentially incomplete list of functional properties for which a subset of genes may be annotated. Based on the framework of finite mixture models, we propose a model selection criterion that takes into account such external gene annotations, providing an efficient tool for selecting a relevant number of clusters and clustering model. This criterion, called the integrated completed annotated likelihood (ICAL), is defined by adding an entropy term to a penalized likelihood to measure the concordance between a clustering partition and the external annotation information. The ICAL leads to the choice of a model that is more easily interpretable with respect to the known functional gene annotations. We illustrate the interest of this model selection criterion in conjunction with Gaussian mixture models on simulated gene expression data and on real RNA-seq data

    A comprehensive review of variable selection in high-dimensional regression for molecular biology

    No full text
    15 pages, 5 tablesVariable selection methods are widely used in molecular biology to detect biomarkers or to infer gene regulatory networks from transcriptomic data. Methods are mainly based on the high-dimensional Gaussian linear regression model and we focus on this framework for this review. We propose a comparison study of variable selection procedures from regularization paths by considering three simulation settings. In the first one, the variables are independent allowing the evaluation of the methods in the theoretical framework used to develop them. In the second setting, two structures of the correlation between variables are considered to evaluate how biological dependencies usually observed affect the estimation. Finally, the third setting mimics the biological complexity of transcription factor regulations, it is the farthest setting from the Gaussian framework. In all the settings, the capacity of prediction and the identification of the explaining variables are evaluated for each method. Our results show that variable selection procedures rely on statistical assumptions that should be carefully checked. The Gaussian assumption and the number of explaining variables are the two key points. As soon as correlation exists, the regularization function Elastic-net provides better results than Lasso. LinSelect, a non-asymptotic model selection method, should be preferred to the eBIC criterion commonly used. Bolasso is a judicious strategy to limit the selection of non explaining variables

    A comprehensive review of variable selection in high-dimensional regression for molecular biology

    No full text
    15 pages, 5 tablesVariable selection methods are widely used in molecular biology to detect biomarkers or to infer gene regulatory networks from transcriptomic data. Methods are mainly based on the high-dimensional Gaussian linear regression model and we focus on this framework for this review. We propose a comparison study of variable selection procedures from regularization paths by considering three simulation settings. In the first one, the variables are independent allowing the evaluation of the methods in the theoretical framework used to develop them. In the second setting, two structures of the correlation between variables are considered to evaluate how biological dependencies usually observed affect the estimation. Finally, the third setting mimics the biological complexity of transcription factor regulations, it is the farthest setting from the Gaussian framework. In all the settings, the capacity of prediction and the identification of the explaining variables are evaluated for each method. Our results show that variable selection procedures rely on statistical assumptions that should be carefully checked. The Gaussian assumption and the number of explaining variables are the two key points. As soon as correlation exists, the regularization function Elastic-net provides better results than Lasso. LinSelect, a non-asymptotic model selection method, should be preferred to the eBIC criterion commonly used. Bolasso is a judicious strategy to limit the selection of non explaining variables

    Data-based filtering for replicated high-throughput transcriptome sequencing experiments

    Get PDF
    Supplementary data are available at Bioinformatics online. Chantier qualité GAInternational audienceRNA sequencing is now widely performed to study differential expression among experimental conditions. As tests are performed on a large number of genes, very stringent false discovery rate control is required at the expense of detection power. Ad hoc filtering techniques are regularly used to moderate this correction by removing genes with low signal, with little attention paid to their impact on downstream analyses

    Working with omics data, an interdisciplinary challenge at the crossroads of biology and computer science

    No full text
    International audienceNowadays, generating omics data is a common activity for laboratories in biology. Experimental protocols to prepare biological samples are well described, and technical platforms to generate omics data from these samples are available in most research institutes. Furthermore, manufacturers constantly propose technical improvements, simultaneously decreasing the cost of experiments and increasing the amount of omics data obtained in a single experiment. In this context, biologists are facing the challenge of dealing with large omics datasets, also called "big data" or "data deluge". Working with omics data raises issues usually handled by computer scientists and thus cooperation between biologists and computer scientists has become essential to efficiently study cellular mechanisms in their entirety, as omics data promise. In this chapter, we define omics data, explain how they are produced, and finally, present some of their applications in fundamental and medical research

    Benchmark of Differential Gene Expression Analysis Methods for Inter-species RNA-Seq Data using a Phylogenetic Simulation Framework

    No full text
    Abstract Inter-species RNA-Seq datasets are increasingly common, and have the potential to answer new questions on gene expression patterns across the evolution. Single species differential expression analysis is a now well studied problem, that benefits from sound statistical methods. Extensive reviews on biological or synthetic datasets have provided the community with a clear picture on the relative performances of the available tools in various settings. Such benchmarks are still missing in the inter-species gene expression context. In this work, we take a first step in this direction by developing and implementing a new simulation framework. This tool builds on both the RNA-Seq and the Phylogenetic Comparative Methods literatures to generate realistic count datasets, while taking into account the phylogenetic relationships between the samples. We illustrate the features of this new framework through a targeted simulation study, that reveals some of the strengths and weaknesses of both the classical and phylogenetic approaches for inter-species differential expression analysis. The tool has been integrated in the R package compcodeR freely available on Bioconductor
    corecore